Tidyverse

Md. Zulquar Nain

What is Tidyverse?

  • The Tidyverse is a collection of R packages for data science that share a common design philosophy, grammar, and data structures.

  • It makes it easy to install and load core packages with a single command.

  • It includes packages that are used in everyday data analysis.

Why use Tidyverse ?

  • Consistency: Consistent syntax and data structures across packages.

  • Ease of Use: Simple, intuitive functions.

  • Performance: Efficient and optimized for modern data science.

  • Versatility: Ideal for various tasks in data manipulation, visualization.

  • Comprehensive: Complete package for all stages of data science workflows.

Key Packages in Tidyverse

  • ggplot2: Data visualization.
  • dplyr: Data manipulation.
  • tidyr: Data tidying.
  • readr: Data input.
  • purrr: Functional programming.
  • tibble: Data frames.
  • stringr: String manipulation.
  • forcats: Factor handling.
  • lubridate: Date and time manipulation.

Installation and Setup

  • To install Tidyverse: install.packages(“tidyverse”)

  • To load Tidyverse into your current R session: library(tidyverse)

Pipe Operator( %>% )

  • The pipe operator (%>%) or (|>) is used to pass the result of one function directly to the next function in a sequence, making your code more readable and intuitive.

  • Pipes allow you to write operations sequentially, enhancing clarity and reducing complexity.

  • Syntax:

    data %\>% function1() %\>% function2() %\>% function3()

  • data: The input data is passed through the pipe.

  • function1(), function2(), function3(): These functions operate on the data in sequence, with the output of one function feeding directly into the next.

  • The shortcut to type the pipe operator in RStudio is given by CTRL/CMD + Shift + M.

Example of pipe operators

Simulating a sample of data by using the function sample we draw randomly (without replacement) 5 numbers between 1 and 20 (1:20) and compute the log transformation of the vector.

# without pipe operator
set.seed(44)
x = sample(1:20, 5)
x
[1] 17 11  1 20  5
log(x)
[1] 2.833213 2.397895 0.000000 2.995732 1.609438
#with pipe operator
x |> log()
[1] 2.833213 2.397895 0.000000 2.995732 1.609438
#it's also possible to omit the parentheses given that there is no input

Package dplyr

Data Manipulation with dplyr

  • dplyr is designed for easy data manipulation using verbs that describe the operations you want to perform. These common functions are:

  • filter() : Select rows based on conditions.

  • select() : Choose columns to keep.

  • mutate(): Add new columns or modify existing ones.

  • arrange(): Sort the data.

  • summarize(): reduces multiple values down to a single summary

  • group_by(): Aggregate data by groups.

dplyr verbs

Taking diamonds dataset of the prices and other attributes of almost 54,000 diamonds (see ?diamonds).

# understand the dataset of diamonds
library(ggplot2)
class(diamonds)
[1] "tbl_df"     "tbl"        "data.frame"
str(diamonds)
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
# we can also use tidyverse function
library(dplyr)
glimpse(diamonds)
Rows: 53,940
Columns: 10
$ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
$ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
$ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
$ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
$ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
$ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
$ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
$ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
$ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

select()

diamonds %>% select(carat)
# A tibble: 53,940 × 1
   carat
   <dbl>
 1  0.23
 2  0.21
 3  0.23
 4  0.29
 5  0.31
 6  0.24
 7  0.24
 8  0.26
 9  0.22
10  0.23
# ℹ 53,930 more rows
  • Selecting more than one column :
diamonds %>% select(carat, cut, color, price)
# A tibble: 53,940 × 4
   carat cut       color price
   <dbl> <ord>     <ord> <int>
 1  0.23 Ideal     E       326
 2  0.21 Premium   E       326
 3  0.23 Good      E       327
 4  0.29 Premium   I       334
 5  0.31 Good      J       335
 6  0.24 Very Good J       336
 7  0.24 Very Good I       336
 8  0.26 Very Good H       337
 9  0.22 Fair      E       337
10  0.23 Very Good H       338
# ℹ 53,930 more rows
  • Alternate choices
diamonds %>% select(carat : color, price)
# A tibble: 53,940 × 4
   carat cut       color price
   <dbl> <ord>     <ord> <int>
 1  0.23 Ideal     E       326
 2  0.21 Premium   E       326
 3  0.23 Good      E       327
 4  0.29 Premium   I       334
 5  0.31 Good      J       335
 6  0.24 Very Good J       336
 7  0.24 Very Good I       336
 8  0.26 Very Good H       337
 9  0.22 Fair      E       337
10  0.23 Very Good H       338
# ℹ 53,930 more rows
#selcting columns starting with 'c'
diamonds %>% select(starts_with("c"))
# A tibble: 53,940 × 4
   carat cut       color clarity
   <dbl> <ord>     <ord> <ord>  
 1  0.23 Ideal     E     SI2    
 2  0.21 Premium   E     SI1    
 3  0.23 Good      E     VS1    
 4  0.29 Premium   I     VS2    
 5  0.31 Good      J     SI2    
 6  0.24 Very Good J     VVS2   
 7  0.24 Very Good I     VVS1   
 8  0.26 Very Good H     SI1    
 9  0.22 Fair      E     VS2    
10  0.23 Very Good H     VS1    
# ℹ 53,930 more rows
#select all the columns but carat
diamonds %>% select(-carat)
# A tibble: 53,940 × 9
   cut       color clarity depth table price     x     y     z
   <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows
  • Now you do this select all the columns but not the ones with a name starting with “c”
diamonds %>% select(- starts_with("c"))
# A tibble: 53,940 × 6
   depth table price     x     y     z
   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  61.5    55   326  3.95  3.98  2.43
 2  59.8    61   326  3.89  3.84  2.31
 3  56.9    65   327  4.05  4.07  2.31
 4  62.4    58   334  4.2   4.23  2.63
 5  63.3    58   335  4.34  4.35  2.75
 6  62.8    57   336  3.94  3.96  2.48
 7  62.3    57   336  3.95  3.98  2.47
 8  61.9    55   337  4.07  4.11  2.53
 9  65.1    61   337  3.87  3.78  2.49
10  59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

filter

diamonds %>% filter(cut == "Premium")
# A tibble: 13,791 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
 2  0.29 Premium I     VS2      62.4    58   334  4.2   4.23  2.63
 3  0.22 Premium F     SI1      60.4    61   342  3.88  3.84  2.33
 4  0.2  Premium E     SI2      60.2    62   345  3.79  3.75  2.27
 5  0.32 Premium E     I1       60.9    58   345  4.38  4.42  2.68
 6  0.24 Premium I     VS1      62.5    57   355  3.97  3.94  2.47
 7  0.29 Premium F     SI1      62.4    58   403  4.24  4.26  2.65
 8  0.22 Premium E     VS2      61.6    58   404  3.93  3.89  2.41
 9  0.22 Premium D     VS2      59.3    62   404  3.91  3.88  2.31
10  0.3  Premium J     SI2      59.3    61   405  4.43  4.38  2.61
# ℹ 13,781 more rows
#include more conditions by using '&' or 'comma'
diamonds %>% filter(cut == "Premium" & color == "D") 
# A tibble: 1,603 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Premium D     VS2      59.3    62   404  3.91  3.88  2.31
 2  0.3  Premium D     SI1      62.6    59   552  4.23  4.27  2.66
 3  0.71 Premium D     SI2      61.7    59  2768  5.71  5.67  3.51
 4  0.71 Premium D     VS2      62.5    60  2770  5.65  5.61  3.52
 5  0.7  Premium D     VS2      58      62  2773  5.87  5.78  3.38
 6  0.72 Premium D     SI1      62.7    59  2782  5.73  5.69  3.58
 7  0.7  Premium D     SI1      62.8    60  2782  5.68  5.66  3.56
 8  0.72 Premium D     SI2      62      60  2795  5.73  5.69  3.54
 9  0.71 Premium D     SI1      62.7    60  2797  5.67  5.71  3.57
10  0.71 Premium D     SI1      61.3    58  2797  5.73  5.75  3.52
# ℹ 1,593 more rows
#The output of a selection can be saved in a new data frame
myselection = diamonds %>% filter(between(price, 500,600))

summarise

#mean and median of price.

diamonds %>% 
  summarise(mean(price),
            median(price))
# A tibble: 1 × 2
  `mean(price)` `median(price)`
          <dbl>           <dbl>
1         3933.            2401
#the number of diamonds with a price > 15000$
diamonds %>% 
  summarise(veryexp = sum(price > 15000),
            veryexpprop = mean(price>15000),
            veryexpperc = mean(price>15000)*100)
# A tibble: 1 × 3
  veryexp veryexpprop veryexpperc
    <int>       <dbl>       <dbl>
1    1655      0.0307        3.07

group_by

  • group_by() is used to group rows of a data frame by one or more columns.

  • This helps in performing operations like summarizing or aggregating data by categories.

diamonds %>% 
  group_by(cut,color) %>% 
  summarise(mean(price)) 
# A tibble: 35 × 3
# Groups:   cut [5]
   cut   color `mean(price)`
   <ord> <ord>         <dbl>
 1 Fair  D             4291.
 2 Fair  E             3682.
 3 Fair  F             3827.
 4 Fair  G             4239.
 5 Fair  H             5136.
 6 Fair  I             4685.
 7 Fair  J             4976.
 8 Good  D             3405.
 9 Good  E             3424.
10 Good  F             3496.
# ℹ 25 more rows

mutate

  • used to create new column or modify existing columns in the data frame
newdiamonds = diamonds %>% 
  mutate(newcol = ifelse(price < 1000, "Yes", "No"))
# Create a new column to categorize products as "Yes" if the price is less than $1000, and "No" otherwise.
# Derive the frequency distribution of 'newcol' along with percentages.
newdiamonds %>% 
  count(newcol) %>% 
  #summarise(perc=n/nrow(newdiamonds)*100)
  mutate(perc=n/nrow(newdiamonds)*100)
# A tibble: 2 × 3
  newcol     n  perc
  <chr>  <int> <dbl>
1 No     39441  73.1
2 Yes    14499  26.9

arrange

#sort diamonds according to price
diamonds %>% 
  arrange(price) %>% 
  tail
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
2  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
3  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
4  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
5  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
6  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
diamonds %>% 
  arrange(desc(price))
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows

Example with dplyr verbs combined

  • Taking ‘mtcars’ and filtering rows where mpg > 20, select specific columns, create a new column, and arrange the data by mpg in descending order.
library(dplyr)

mtcars %>% 
  filter(mpg > 20) %>%
  select(mpg, wt, hp) %>% 
  mutate(mpg_per_weight = mpg / wt) %>% 
  arrange(desc(mpg))
                mpg    wt  hp mpg_per_weight
Toyota Corolla 33.9 1.835  65      18.474114
Fiat 128       32.4 2.200  66      14.727273
Honda Civic    30.4 1.615  52      18.823529
Lotus Europa   30.4 1.513 113      20.092531
Fiat X1-9      27.3 1.935  66      14.108527
Porsche 914-2  26.0 2.140  91      12.149533
Merc 240D      24.4 3.190  62       7.648903
Datsun 710     22.8 2.320  93       9.827586
Merc 230       22.8 3.150  95       7.238095
Toyota Corona  21.5 2.465  97       8.722110
Hornet 4 Drive 21.4 3.215 110       6.656299
Volvo 142E     21.4 2.780 109       7.697842
Mazda RX4      21.0 2.620 110       8.015267
Mazda RX4 Wag  21.0 2.875 110       7.304348

Thanks